Yalda Data Analysis
Introduction
Hello Yalda, this data has been fun to analyse thank you. I have split this up into segments and have included my code in the clickable drop down menus. If you have any questions about the code just let me know, but in each section I should have explained what it is I am trying to do.
Step 1: Read in the raw data and check data
So to do this I have edited your data on excel and will send that file to you so you can work with the same original data set. This involved removing the top two rows, any of them random columns that probably means something but I don’t really care about them. Also the shit at the bottom is gone, i think these are just proteins that some peptides were identified but maybe below a cut-off.
# Read in data frame
df <- read.csv("Yalda_ligament_data.csv") %>%
clean_names()
# Now want to see how many proteins there are
print(paste("Number of proteins =", dim(df)[1]-1))## [1] "Number of proteins = 1813"
# Visualise the number of missing values per sample
df %>%
melt() %>%
group_by(variable) %>%
summarise(count = sum((value == 0)/(dim(df)[1]-1)*100)) %>%
ggplot(aes(x = variable, y = count))+
geom_bar(aes(fill = count), stat = "identity", col = "black")+
theme(axis.text.x = element_text(angle = 90, vjust = 0.27))+
labs(y = "Percent of Missing Values", x = "Sample ID", fill = "")+
scale_fill_continuous(type = "viridis")So the main thing about this is you have a lot of proteins and not too many missing values. Of note one sample has a lot more missing values than the rest (HC 21) but we can ignore this for now and then move back to it later if we see some strange results (Outlier on PCA)
Step 2: Remove proteins with too many missing values
The problems with proteins that have too many missing values is that any comparisons you make with them are going to be difficult to analyse and lack any statistical power. Also with the use of imputation methods you are potentially going to make up an effect.
df_check_missing <- df %>%
melt() %>%
mutate(group = case_when(str_detect(variable, "dc")~"disease_control",
str_detect(variable, "dm")~"disease_mimic",
str_detect(variable, "da")~"disease_antag",
str_detect(variable, "hc")~"healthy_control",
str_detect(variable, "hm")~"healthy_mimic",
str_detect(variable, "ha")~"healthy_antag")) %>%
group_by(accession, group) %>%
summarise(count = sum(value == 0))## Using accession as id variables
## `summarise()` has grouped output by 'accession'. You can override using the
## `.groups` argument.
proteins_removed <- df_check_missing %>%
filter(count >= 3)
df_removed <- df %>%
filter(accession %in% proteins_removed$accession) %>%
dplyr::rename(Accession = accession)
datatable(df_removed,
caption = "Table of proteins removed from analysis",
options = list(scrollX = TRUE,
scrollY = TRUE),
rownames = F)Removal of these proteins can potentially mean losing information from your analysis but we are only losing 56 proteins from a total 1813 so its not a massive issue. If you want to spend time looking at these you can, I would take the accession codes and do a quick UniProt ID mapping and look at the proteins and see if there any that are relevant, if you see one that is then you can maybe leave that protein in the analysis but this is a bit dodgy so I wouldn’t recommend it.
Step 3: Now we need to compare normalisation strategies
So i am using an Rpackage to do this called NormalyzerDE, this comes recommended from CBF. Basically you normalise data based on many metrics and you see which is best based on that (very confusing). Very simply put it gives you an argument as to why you used a certain method.
To do this I need to make some metadata and also some data wrangling is needed.
df_for_normal <- df %>%
filter(!accession %in% df_removed$Accession) %>%
column_to_rownames(var = 'accession') %>%
mutate_all(~ ifelse(. == 0, NA, .))
design <- data.frame(sample = colnames(df_for_normal),
group = rep(c("disease_control", "disease_mimic", "disease_antag", "healthy_control", "healthy_mimic", "healthy_antag")))
sum_obj <- SummarizedExperiment(as.matrix(df_for_normal+1),
colData = design,
rowData = rownames(df_for_normal))
# This is the code it wont work when making this document
#normalyzer(jobName = "Yalda_normalidation",
#experimentObj = sum_obj,
#outputDir = "")
Step 4: Normalise the data (Quantile normalisation)
So based on these metrics I went for quantile normalisation, you could also argue median is okay for this but i was slightly worried about the MA plot. If I’m being honest, i do this step but don’t actually fully understand the output. But when people asked I used statistical parameters to observe the efficacy of eight normalisation methods and chose a representative method.
Quantile normalisation definition: It makes the assumption that the data in different samples should originate from an identical distribution. It does this by generating a reference distribution and then scaling the other samples accordingly.
df_normalised <- performQuantileNormalization(rawMatrix = as.matrix(df_for_normal)) %>%
as.data.frame()
rownames(df_normalised) <- rownames(df_for_normal)
datatable(df_normalised,
caption = "Normalised dataset",
options = list(scrollX = TRUE,
scrollY = TRUE),
rownames = T)Step 5: Test some imputation methods
Right so this is a bit complicated. To do this I basically manually make values missing and then impute them. Then I calculate the difference between the two. The smaller that difference the better. So the methods I test are;
I’m gonna get round to this but the clustering takes a while
Step 6: Principle component analysis with imputed data and also other clustering analyis
Step 7: Clustering analysis without imputation
We can do clustering analysis without imputation and just consider missing values as 0. This will probably be okay considering you don’t have that many missing values but we wont know until we can compare both.
# change the NA to 0
df_PCA <- df_normalised
df_PCA[is.na(df_PCA)] <- 0
# do the PCA
PCA_noimp <- prcomp(t(df_PCA), scale. = T)
PCA_noimp_df <- PCA_noimp$x %>%
as.data.frame() %>%
rownames_to_column(var = "accession") %>%
mutate(group = case_when(str_detect(accession, "dc")~"disease_control",
str_detect(accession, "dm")~"disease_mimic",
str_detect(accession, "da")~"disease_antag",
str_detect(accession, "hc")~"healthy_control",
str_detect(accession, "hm")~"healthy_mimic",
str_detect(accession, "ha")~"healthy_antag"))
PCA_sum <- summary(PCA_noimp)
# Plot with all
PCA_noimp_df %>%
ggplot(aes(x = PC1, y = PC2))+
geom_point(aes(fill = group),
stroke = 1, shape = 21, size = 4)+
stat_ellipse(aes(fill = group),
geom = "polygon",
alpha = 0.4,
col = "black")+
theme(axis.title = element_text(size = 15),
legend.text = element_text(size = 15),
legend.title = element_text(size = 18),
axis.text = element_text(size = 15))+
labs(x = paste0("PC1: ",
round(PCA_sum$importance[2,1]*100,
digits = 2),
"% Variance explained"),
y = paste0("PC2: ",
round(PCA_sum$importance[2,2]*100,
digits = 2),
"% Variance explained"))+
geom_label_repel(aes(label = accession))## Warning: ggrepel: 28 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
# Plot without
filtered_pcaplot <- PCA_noimp_df %>%
filter(!accession %in% c("x080224_acl_21_hc", "x080224_acl_3_dc"))
filtered_pcaplot %>%
ggplot(aes(x = PC1, y = PC2))+
geom_point(aes(fill = group),
stroke = 1, shape = 21, size = 4)+
theme(axis.title = element_text(size = 15),
legend.text = element_text(size = 15),
legend.title = element_text(size = 18),
axis.text = element_text(size = 15))+
labs(x = paste0("PC1: ",
round(PCA_sum$importance[2,1]*100,
digits = 2),
"% Variance explained"),
y = paste0("PC2: ",
round(PCA_sum$importance[2,2]*100,
digits = 2),
"% Variance explained"))So looks like theres two points that look a bit weird. Number 21 (Healthy control) and Number 3 (Diseased control). It may be worth going back and looking at those samples to see if there was anything different about them.
Other than those it doesnt look like theres too much seperation between the points.
Also plotted it in 3 dimensions to see if there was any difference.
colours <- c("#F8766D", "#B79F00", "#00BA38", "#00BFC4", "#619CFF", "#F564E3")
plot_ly(x = filtered_pcaplot$PC1,
y = filtered_pcaplot$PC2,
z = filtered_pcaplot$PC3,
type = "scatter3d",
mode = "markers",
marker = list(size = 4,
color = colours[as.numeric(factor(filtered_pcaplot$group))]),
color = factor(filtered_pcaplot$group)) %>%
layout(title = "", xaxis = list(title = 'PC1'),
yaxis = list(title = 'PC2'), legend = list(title=list(text='<b> Type of tissue </b>')))Yeah this didn’t really show anything, i’m gonna leave it in cause it looks cool and is fun to move around but looks like samples are quite similar. This isn’t a bad thing as should just increase the significance of things being different.
Step 8: Differential expression analysis
Okay so this is the best bit, what I am going to do is the comparisons we mentioned when we were discussing them, i am going to firstly import gene names into the dataset instead of accession codes as this will be easier to use, however when I give you the differential expression tables I will make sure accession, gene name and protein name are included if available.
To get the gene names I am using uniprot ID mapping.
accession_codes <- rownames(df_normalised)
accession_codes <- unlist(lapply(strsplit(accession_codes, ";"), function(x)x[1]))
print(paste("Number of duplicated accession codes =", anyDuplicated(accession_codes)))## [1] "Number of duplicated accession codes = 0"
mappings <- read.csv("Uniprotmappings.csv") %>%
clean_names() %>%
mutate(gene_names = case_when(gene_names == ""~entry,
gene_names != ""~gene_names))
datatable(mappings,
caption = "Gene name mappings",
options = list(scrollX = TRUE,
scrollY = TRUE),
rownames = F)## Warning in instance$preRenderHook(instance): It seems your data is too big for
## client-side DataTables. You may consider server-side processing:
## https://rstudio.github.io/DT/server.html
Sum_obj_final <- SummarizedExperiment(as.matrix(df_normalised),
colData = design,
rowData = rownames(df_normalised))
# This is the code used for diff expression, as i said before it doesnt work with this document so i did it seperately
#normalyzerDE(jobName = "Yalda_diff_expr",
# experimentObj = Sum_obj_final,
# outputDir = "../../Yalda_Ligament/",
# comparisons = c("disease_control-healthy_control",
# "disease_antag-disease_control",
# "disease_mimic-disease_control",
# "healthy_antag-healthy_control",
# "healthy_mimic-healthy_control",
# "disease_mimic-healthy_mimic",
# "disease_antag-healthy_antag"),
# logTrans = F,
# type = "limma")
diff_rep <- read_tsv(file = "Yalda_diff_expr/Yalda_diff_expr_stats.tsv") %>%
mutate(X = unlist(lapply(strsplit(X, ";"), function(x)x[1])))## Rows: 1758 Columns: 53
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: "\t"
## chr (1): X
## dbl (52): disease_control-healthy_control_PValue, disease_antag-disease_cont...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Now want to make my life easier by making a function that works quickly
make_volcano <- function(dataframe, group1, group2, fold_change, pvalue, mappings_df){
# Make fold change column
fold_change_col <- paste0(group1, "-", group2,"_", "log2FoldChange")
# Make p value column
pval_col <- paste0(group1, "-", group2,"_", "PValue")
# Make dataframe
diff_rep_2 <- dataframe %>%
select(X, !!sym(fold_change_col), !!sym(pval_col)) %>%
dplyr::rename(FC = !!sym(fold_change_col),
pval = !!sym(pval_col)) %>%
mutate(regulation = case_when(
FC <= -fold_change & pval <= pvalue ~ "Down",
FC >= fold_change & pval <= pvalue~"Up",
FC < fold_change & FC > -fold_change & pval <= pvalue~"Same",
pval > pvalue~"Non-Sig"),
regulation = factor(regulation, levels = c("Up", "Down", "Same", "Non-Sig"))) %>%
mutate(Gene_name = mappings_df$gene_names,
Protein_name = mappings_df$protein_names)
number_updownsamenon <- length(unique(diff_rep_2$regulation))
if (number_updownsamenon == 4){
volcplot <- diff_rep_2 %>%
ggplot(aes(x = FC, y = -log10(pval)))+
geom_point(aes(col = regulation), size = 3, alpha = 0.4)+
labs(x = paste0("log2(",group1, "/",group2, ")"),
y = '-log10(P-Value)',
title = paste0("Volcano Plot: ", group1, "/", group2),
fill = "Regulation")+
theme_light()+
geom_vline(xintercept = 1)+
geom_vline(xintercept = -1)+
geom_hline(yintercept = 1.3)+
scale_color_manual(values = c("#DC143C", "#1E90FF", "#D2691E", "#C0C0C0"))+
geom_label_repel(aes(label = Gene_name), show.legend = F)
} else {
volcplot <- diff_rep_2 %>%
ggplot(aes(x = FC, y = -log10(pval)))+
geom_point(aes(col = regulation), size = 3, alpha = 0.4)+
labs(x = paste0("log2(",group1, "/",group2, ")"),
y = '-log10(P-Value)',
title = paste0("Volcano Plot: ", group1, "/", group2),
fill = "Regulation")+
theme_light()+
geom_vline(xintercept = 1)+
geom_vline(xintercept = -1)+
geom_hline(yintercept = 1.3)+
scale_color_manual(values = c("#DC143C", "#1E90FF", "#C0C0C0"))+
geom_label_repel(aes(label = Gene_name), show.legend = F)
}
return(list(volcano = volcplot,
data = diff_rep_2))
}Volcano plot
Disease_control vs Healthy_control
thefunctionthing <- make_volcano(dataframe = diff_rep, group1 = "disease_control",
group2 = "healthy_control", fold_change = 1, pvalue = 0.05, mappings_df = mappings)
thefunctionthing$volcano## Warning: ggrepel: 1742 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
datatable(thefunctionthing$data,
caption = "Disease_control vs Healthy_control",
options = list(scrollX = TRUE,
scrollY = TRUE),
rownames = F)Healthy_mimic vs Healthy_control
thefunctionthing <- make_volcano(dataframe = diff_rep, group1 = "healthy_mimic",
group2 = "healthy_control", fold_change = 1, pvalue = 0.05, mappings_df = mappings)
thefunctionthing$volcano## Warning: ggrepel: 1737 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
datatable(thefunctionthing$data,
caption = "Healthy_mimic vs Healthy_control",
options = list(scrollX = TRUE,
scrollY = TRUE),
rownames = F)Healthy_antag vs Healthy_control
thefunctionthing <- make_volcano(dataframe = diff_rep, group1 = "healthy_antag",
group2 = "healthy_control", fold_change = 1, pvalue = 0.05, mappings_df = mappings)
thefunctionthing$volcano## Warning: ggrepel: 1739 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
datatable(thefunctionthing$data,
caption = "Healthy_antag vs Healthy_control",
options = list(scrollX = TRUE,
scrollY = TRUE),
rownames = F)Disease_mimic vs Disease_control
thefunctionthing <- make_volcano(dataframe = diff_rep, group1 = "disease_mimic",
group2 = "disease_control", fold_change = 1, pvalue = 0.05, mappings_df = mappings)
thefunctionthing$volcano## Warning: ggrepel: 1737 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
datatable(thefunctionthing$data,
caption = "Disease_mimic vs Disease_control",
options = list(scrollX = TRUE,
scrollY = TRUE),
rownames = F)Disease_antag vs Disease_control
thefunctionthing <- make_volcano(dataframe = diff_rep, group1 = "disease_antag",
group2 = "disease_control", fold_change = 1, pvalue = 0.05, mappings_df = mappings)
thefunctionthing$volcano## Warning: ggrepel: 1743 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
datatable(thefunctionthing$data,
caption = "Disease_antag vs Disease_control",
options = list(scrollX = TRUE,
scrollY = TRUE),
rownames = F)Disease_mimic vs Healthy_mimic
thefunctionthing <- make_volcano(dataframe = diff_rep, group1 = "disease_mimic",
group2 = "healthy_mimic", fold_change = 1, pvalue = 0.05, mappings_df = mappings)
thefunctionthing$volcano## Warning: ggrepel: 1725 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
datatable(thefunctionthing$data,
caption = "Disease_mimic vs Healthy_mimic",
options = list(scrollX = TRUE,
scrollY = TRUE),
rownames = F)Disease_antag vs Healthy_antag
thefunctionthing <- make_volcano(dataframe = diff_rep, group1 = "disease_antag",
group2 = "healthy_antag", fold_change = 1, pvalue = 0.05, mappings_df = mappings)
thefunctionthing$volcano## Warning: ggrepel: 1738 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
datatable(thefunctionthing$data,
caption = "Disease_mimic vs Healthy_mimic",
options = list(scrollX = TRUE,
scrollY = TRUE),
rownames = F)